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^ \ The human genome sequence database contains DNA sequences very like those of mycoplasma 

• ' molds. It appears such moulds infect not only molecular Biology laboratories but were picked 

^ . up by experimenters from contaminated samples and inserted into GenBank as if they were 

human. At least one mouldy EST (Expressed Sequence Tag) has transferred from public 
databases to commercial tools (Affymetrix HG-U133 plus 2.0 microarrays). We report a sec- 
ond example (DA466599) and suggest there is a need to clean up genomic databases but fear 
^ \ current tools wiU be inadequate to catch genes which have jumped the silicon barrier. 
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1 Introduction 

Ensuring databases are both up to date and contain only correct data is a huge software engineering prob- 
lem. Even as the human genome was first published the associated problems of data cleansing Bioinfor- 
matics sequence data were being discussed 1213 but it appears only technical problems where considered. 

We discovered that the definitive publicly accessible database holding the human DNA sequence has been 
corrupted in a surprising way. It contains the DNA sequence of a mold 13 1. 

More recently we have discovered a second sequence which is probably not human in the human genome. 
It appears that the time is ripe for a though check of the NCBI GenBank database. 

It appears that not only has the Human DNA sequence been "completely sequenced" 1 1 1 but in the process 
other living organisms commonly found in Molecular Biology laboratories have infected not just the phys- 
ical samples but also the virtual in silico Bioinformatics environment. By unwittingly using a technique 
reminiscent of computer hacking, a mold gene has succeeded in not just moving within its own genome EI 
nor only jumping horizontally and crossing the species barrier [51 but has crossed the silicon barrier be- 
tween life and data and succeeded in reproducing itself across very diverse information based media. Given 
the highly interconnected nature of genomic research, technology and medicine and the low priority so far 
attached to the problem, it is unlikely current data warehouse cleansing techniques will be able to eradicate 
this and potentially other silicon jumping genes. 

2 Computational in silico Experiment 

The anomalous HG-U133 +2 sequence (GenBank AF241217, probeset 1570561_at) we had previously 
reported 13 1 was run against the human genome using Blast [6J, at the European Bioinformatics Institute 
EMBL-EMI with their default settings. This gave a list of DNA sequences which partially match pub- 
lished DNA sequences. The list is ordered by blastn so that the best matches are at the top. Only the 
top 50 fuzzy matches are included in the list. As expected the first match is the query sequence itself 
(EM_HTG:AF241217). Despite 13 1 having been published more than a year ago, EM_HTG:AF241217 is 
still described as "Homo sapiens". All the others are mycoplasma, except the 34*^* in the list, DA466599, 
which EBI says is human. (EBI gives one reference for DA466599: [7 1.) However we suggest that 
DA466599 may not be a human DNA sequences but is another example of physical contamination leading 
to virtual infection of the public data. 

We ran a second EBI blastn query (again using the NCBI em_rel database). This time looking for DNA 
sequences that match DA466599. The results for DA466599 are similar to those for AF241217 and so 
support the view that DNA sequence DA466599 is not human but instead is also a contamination. Again 
the best 50 matches were reported. Of course the first one is DA466599 itself. All the other matches 
returned by blastn are for various species of mycoplasma. 

3 Discussion 

It is well known that mycoplasma contamination is rife in molecular biology laboratories fSl. Many labs 
are routinely periodically sterilised to counter it. Miller et al. 1 8 1 said mycoplasma contamination has 
"potentially major consequences for the diagnosis and characterization of diseases using expression array 
technology." Nonetheless, using RNAneQ, we previously estimated about 1 % of published data in the Gene 
Expression Omnibus (GEO) database at NCBI ( |www . ncbi . nlm. nih . gov/g eo) are contaminated 0. 

One potential fortuitous side effect of the in silico spread of mycoplasma contamination is that the 
Affymetrix HF-U133 +2 1570561_at probeset might be used to indicate physical sample contamination. 
Thus probeset 1570561_at could be treated as a free additional quality control signal. If 1570561_at says 
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there is significant expression of mycoplasma genes, then the sample is probably contaminated and the 
other gene expression levels given by the microarray are suspect. 

Having found two suspect DNA sequences it seems likely the published "human genome" sequence con- 
tains more. Indeed contamination of all organism sequences seems possible. With the explosive growth 
of genomic sequence data available via the Internet, including data from the 1000 genome project O, it 
seems time to look again at genomic database quality. 
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